## **EECS 370 - Fall20 Final Exam - Solution**

## **Question 1: Short Answer T/F**

Specify whether statements a-j are True or False. (On the exam a random set of 10 questions was chosen)

|    |                                                                                                                                    | True/False |
|----|------------------------------------------------------------------------------------------------------------------------------------|------------|
| a. | There are 32-bit two's complement integers that cannot be represented in a 32-bit IEEE-754 single precision number.                | Т          |
| b. | Deeper pipelines will generally have a shorter clock period.                                                                       | Т          |
| C. | It is possible for a write-back and a write-through cache to perform the same number of writes.                                    | Т          |
| d. | For a given cache size and block size, increasing associativity will reduce cache misses for any program.                          | F          |
| e. | Resolving data hazards with detect and stall and control hazards with speculate and squash will produce the least number of errors | Т          |
| f. | Direction prediction for jalr is worse than beq.                                                                                   | F          |
| g. | Fetch stage can read an instruction from memory when there is a lw/sw in the MEM stage.                                            | F          |
| h. | Physical memory and disk is typically large enough to hold all of a process's virtual memory.                                      | F          |
| i. | DRAM is a non-volatile memory.                                                                                                     | F          |
| j. | You have scrolled through the whole exam to understand point distribution and allocate your time as such.                          | Т          |
| k. | If a program accesses the same cache block repeatedly, it is has                                                                   | Т          |
|    | higher temporal locality                                                                                                           |            |
| l. | Multi-level page tables will always take up less space than single-level page tables.                                              | F          |
| m. | Cache is about the same size as main memory.                                                                                       | F          |

| n. | Dirty bits are needed for write-through caches but not for write-back caches.                                                            | F |
|----|------------------------------------------------------------------------------------------------------------------------------------------|---|
| 0. | Assuming we have an allocate-on-write store policy, an infinitely large cache would have only one miss per cache block.                  | Т |
| p. | The number of block offset bits is equal to $\log_2(\text{size of the cache})$ .                                                         | F |
| q. | For a given cache configuration, a write-through cache may write fewer bytes to memory than a write-back cache running the same program. | Т |
| r. | Virtual memory offers the illusion of infinite memory space.                                                                             | Т |
| S. | A disadvantage of having a deeper pipeline is that there will be more stalls from data and control hazards.                              | Т |
| t. | For a given cache size and block size, increasing associativity will reduce tag size.                                                    | F |

## Question 2: Points \_/8

Assume the following LC2K program is executed on the 5-stage pipeline from lecture until it halts.

| 1  |       | lw    | 0 | 1 | neg1  |
|----|-------|-------|---|---|-------|
| 2  |       | lw    | 0 | 2 | pos1  |
| 3  |       | lw    | 0 | 3 | five  |
| 4  |       | lw    | 0 | 4 | count |
| 5  | loop  | add   | 4 | 2 | 4     |
| 6  |       | add   | 3 | 1 | 3     |
| 7  |       | SW    | 0 | 4 | count |
| 8  |       | beq   | 0 | 3 | end   |
| 9  |       | beq   | 0 | 0 | loop  |
| 10 | end   | halt  |   |   |       |
| 11 | neg1  | .fill |   |   | -1    |
| 12 | pos1  | .fill |   |   | 1     |
| 13 | five  | .fill |   |   | 5     |
| 14 | count | .fill |   |   | 0     |
|    |       |       |   |   |       |

- A. Select all registers that cause a RAW data hazard in a detect and stall pipeline.
  - a. reg0
  - b. reg1
  - c. reg2
  - d. reg3
  - e. reg4
  - f. reg5
  - g. reg6

Lines 4-5 need 2 stalls Lines 5-7 need 1 stall

B. Using detect-and-stall to resolve data hazards, what is the number of stalls that occur due to data hazards?

#### 7 = 2 + 5 \* 1. 2 stalls from lines 4-5. 1 stall for lines 5-7 (Lines 6-8 is NOT a dependency)

C. Using detect-and-forward to resolve data hazards, what is the number of stalls that occur due to data hazards?

1 stall is required to resolve the lw-dependency on lines 4 - 5.

D. Using speculate-and-squash, along with predicting branches always-not-taken to resolve control hazards, what is the number of stalls that occur due to branch mispredictions?

Each misprediction costs 3 cycles. You can calculate the number of mispredictions by looking at how many times each branch is taken.

15 = 5 \* 3. Line 9, beg 0 0, is taken 4 times. Line 8, beg 0 3, is taken 1 time.

#### **Question 3: Pipeline Design (11 Points)**

Consider the LC2K pipeline with the following change:

- Execution (EX) stage split into two (EX1, EX2).
- EX1 calculates the branch destination and performs the equality check of beq.
- EX2 utilizes the ALU for the add/nor/lw/sw instructions. EX2 also updates the PC for branches, if necessary.
- Assume all values are forwarded to EX1.

Control hazards resolved using **speculate and squash**, where branches are predicted to be **not-taken**.

Data hazards resolved using **detect and forward**.

Consider the following LC2K program:

```
lw
            0
                  1
                  2
           0
                        1
2.
     lw
          2
                  2
                        1
3.
     beq
4.
     add
           1
                  1
                        1
                  2
5.
     add
          1
                        1
           2
                  2
6.
     add
7.
     Halt
```

You are a hacker that has somehow managed to take the following snapshot of the pipeline registers while executing the above program **before cycle C**. Your goal is to determine a secret number that is stored in **virtual memory address 0**.

| IF/ID  |      |  |  |
|--------|------|--|--|
| instr: | XXXX |  |  |
| PC+1:  | X    |  |  |
|        |      |  |  |
|        |      |  |  |
|        |      |  |  |

| ID/EX1    |      |  |  |
|-----------|------|--|--|
| instr:    | XXXX |  |  |
| PC+1:     | X    |  |  |
| regA val: |      |  |  |
| regB val: |      |  |  |
| offset:   |      |  |  |

| EX1/EX2       |                |  |
|---------------|----------------|--|
| Instr: (A)    | _ <b>X X X</b> |  |
| PC Plus 1:    | X              |  |
| regA val:     | X              |  |
| regB val:     | X              |  |
| branchTarget: | X              |  |

| EX2/MEM       |    |  |  |
|---------------|----|--|--|
| Instr: (B)    |    |  |  |
| AluResult:    | 36 |  |  |
| regB val: (C) | 1  |  |  |
|               |    |  |  |
|               |    |  |  |

| MEM/WB     |     |  |  |
|------------|-----|--|--|
| Instr: (D) | add |  |  |
| writeData: | 60  |  |  |
|            |     |  |  |
|            |     |  |  |
|            |     |  |  |

| WB/END     |            |  |  |
|------------|------------|--|--|
| instr:     | noop X X X |  |  |
| writeData: | X          |  |  |
|            |            |  |  |
|            |            |  |  |
|            |            |  |  |

List all the data hazards. Specify each data hazard in the following format:

line #X, line #Y, number-of-cycles-stalled

where X, Y are lines in the LC2K program that have a RAW dependency causing the data hazard.

### Line #2, Line #3, 2 cycles

List all the control hazards. Specify each control hazard in the following format:

line #X, number-of-cycles-stalled

where X is a line in the LC2K program that causes the control hazard.

Line #3, 3 cycles

How many instructions does the LC2K program execute? (enter only the number here)

#### 6 instructions

Fill in the missing blanks in the highlighted boxes from the pipeline snapshot above.

EX1/EX2 - (A): halt

EX2/MEM - (B): add 2 2 3

EX2/MEM - (C): 18

MEM/WB - (D): (add) 1 2 1

If we assume that the first instruction of the program is in the fetch stage during Cycle 0, before which cycle (C) was the pipeline snapshot taken? (show your work)

13

2 data hazard stalls + 3 control hazard stalls + 6 instructions + 2 cycles for halt to get to EX1

What is the secret number at memory location 0? (enter only the number here)

42

### **Question 4: Pipeline Performance (6 points)**

Consider a normal 5-stage LC2K pipeline as discussed in class with the following features:

- Using **detect-and-forward** to handle data hazards.
- Using speculate-and-squash to handle control hazards and always predict "Not Taken".
- Branches are resolved in the MEM stage.
- Data memory access (and the critical timing path in the MEM stage) is 10 ns, while the critical path in every other stage is 6 ns

Assume a benchmark with the following characteristics will be run on this pipeline:

- add/nor: 50%beq: 15%lw: 30%sw: 5%
- 40% of all branches are Taken
- 20% of lw instructions are immediately followed by a dependent instruction.
- 10% of lw instructions (disjoint from the previous percentage) are followed by a dependent instruction, but have a single non-dependent instruction between the lw and the dependent instruction.

What is the CPI of this pipeline when running this benchmark?

```
1 + 0.15 * 0.40 * 3 (beg) + 0.30 * 0.20 * 1 (lw) = 1.24 CPI
```

Now, the MEM stage is split into two stages. It reduces the cycle time by splitting the data memory access latency equally between the MEM stages. Branches are resolved in the first MEM stage.

What is the new CPI?

```
1 + 0.15 * 0.40 * 3 (beg) + 0.30 * 0.20 * 1 * 2 (lw imm dep) + 0.30 * 0.10 * 1 (lw) = 1.33 CPI
```

Assuming 10 billion instructions execute, what is the total execution time for both the original (subquestion 1 above) and modified (subquestion 2 above) pipeline? Show your work for credit.

Old execution time: 1.24 \* 10 billion \* 10 ns = 124 s New execution time: 1.33 \* 10 billion \* 6 ns = 79.8 s

## **Question 5: Reverse Engineering the Cache (14 points)**

Assume the following:

8-bit byte-addressable ISA

Cache size: <= 32B (not including overhead).</li>
 Associativity: Fully-associative cache

Cache block size and number of sets are both powers of two.

Cache is initially empty.

Given the following cache outcomes (hit/miss), determine the cache block size and number of cache blocks by answering the following questions.

| Access # | Address | Tag | Cache Hit/<br>Miss |
|----------|---------|-----|--------------------|
| 1        | 0x80    |     | М                  |
| 2        | 0xB4    |     | М                  |
| 3        | 0x81    |     | Н                  |
| 4        | 0xAF    |     | M                  |
| 5        | 0xAC    |     | Н                  |
| 6        | 0xB3    |     | М                  |
| 7        | 0x81    |     | М                  |
| 8        | 0x87    |     | Н                  |
| 9        | 0x75    |     | М                  |
| 10       | 0x79    |     | М                  |

#### 5.1)

Cache **block size** is greater than or equal to (>=) **N** Bytes. Determine maximum value for N and enter **only a number for N** here:

8B - maximum referenced the largest lower bound possible (ie 8 is a larger lower bound than 2)

Your choice of (>=) cache block size above is known because of which access numbers from the table (choose exactly two)?

access # 7 & 8

Cache **block size** is less than (<) **N** Bytes. Determine minimum value for N and enter only a number for N here:

16 B - minimum referenced the smallest upper bound possible (ie 32 is a larger upper bound than 16)

Your choice of (<) cache block size above is known because of which access numbers from the table (choose exactly two)?

access 9 & 10

Therefore, cache block size is **N** Bytes (enter **only a number for N** here): 8B

## 5.2)

Number of cache blocks is greater than or equal to (>=) N # of blocks (enter only a number for N here):

#### 2 blocks

Your choice of (>=) number of cache blocks above is known because of which access numbers from the table? Choose exactly two accesses with the same tag:

access # 1 & 3

Number of cache blocks is less than (<) N # of blocks (enter only a number for N here):

#### 4 blocks

Your choice of (<) number of cache blocks above is known because of which access numbers from the table. Choose exactly two accesses with the same tag: access # 3 & 7 or # 2 & 6 (same tag req.)

Therefore, number of cache blocks is **N # of blocks** (enter **only a number for N** here)?

2

To gain more context of the cache and how to go about this problem see below:

- Access 1 & 2 only have the first two bits the same- telling students that tag bits are at least two bits; however, with the block offset being 6 bits this would exceed the size of the cache specs given
- Access 3 tells us the tag can be at most 7 bits long (block offset being only 1 bit if this is true)

- Access 4 tells students that the tag must be at least 4 bits (if the tag were 3 bits or less there would be a hit given access 2)
- Access 5 tells students that the block offset has to be at least two bits for this to be a hit with access 4 (meaning the tag can be at most 6 bits)
- Access 7 tells us that there cannot be more than two blocks OR that the tag is 6 bits- also this being 0x81 again and being a miss tells students this cache block was evicted
- Access 8 tells students the block offset must be 3 bits for this to be a hit
- Access 9 & 10- this being a miss confirms that the block offset must be 3 bits and that the tag must be 5 bits

### Question 6: 3 C's (2 Points)

Consider a scenario where we increase the cache block size in a set-associative cache. Cache size and associativity is kept unchanged. Choose the answer for each of the following 4 sub-questions.

| following 4 sub-questions.             |  |
|----------------------------------------|--|
| <b>6.1)</b> Number of sets:            |  |
| increases                              |  |
| decreases                              |  |
| remains the same                       |  |
| <b>6.2)</b> Number of cache blocks:    |  |
| increases                              |  |
| decreases                              |  |
| remains the same                       |  |
| <b>6.3)</b> Compulsory misses:         |  |
| increases                              |  |
| decreases                              |  |
| remains the same                       |  |
| <b>6.4)</b> Exploits spatial locality: |  |
| better                                 |  |
| ■ worse                                |  |
| about the same                         |  |

## **Question 7: 3 C's Part II (6 Points)**

Consider a byte-addressable architecture with the following cache:

Cache size: 64 bytes Cache block size: 16 bytes

Associativity: 2-way set-associative

Assume that the cache is initially empty, and uses LRU replacement policy. Lower way is chosen when the set is empty.

Answer the following questions for the list of memory references show in this table. (Empty columns are just workspace).

| Reference | Tag   | Set Index | Hit/Miss | Type (if miss) | Tag of<br>evicted block<br>(if applicable) |
|-----------|-------|-----------|----------|----------------|--------------------------------------------|
| 0x0DD5    | 0x6E  | 0x1       | miss     | compulsory     | -                                          |
| 0x1F07    | 0xF8  | 0x0       | miss     | compulsory     | -                                          |
| 0x02AE    | 0x15  | 0x0       | miss     | compulsory     | -                                          |
| 0x0509    | 0x28  | 0x0       | miss     | compulsory     | 0xF8                                       |
| 0x2EDF    | 0x176 | 0x1       | miss     | compulsory     | -                                          |
| 0x1F0F    | 0xF8  | 0x0       | miss     | conflict       | 0x15                                       |
| 0x1F01    | 0xF8  | 0x0       | hit      | -              | -                                          |
| 0x48D7    | 0x246 | 0x1       | miss     | compulsory     | 0x6E                                       |
| 0x0DDA    | 0x6E  | 0x1       | miss     | capacity       | 0x176                                      |
| 0x0380    | 0x1C  | 0x0       | miss     | compulsory     | 0x28                                       |

**<sup>7.1)</sup>** List all the memory references that incur **conflict misses** in the cache. State each memory reference in the following format:

# ROW, address (0xZZZZ), tag (0xZZZ), set-index (0xZ), tag-of-replaced-block-if-any (0xZZZ)

Each letter "Z" above stands for one hexadecimal digit. (e.g.,: ROW-X, 0x1234, 0x120, 0x1, 0x000)

ROW-F, 0x1F0F, 0xF8, 0x0, 0x15

**7.2)** List all the memory references that incur **capacity misses** in the cache. State each memory reference in the following format:

ROW, address (0xZZZZ), tag (0xZZZ), set-index (0xZ), tag-of-replaced-block-if-any (0xZZZ)

Each letter "Z" above stands for one hexadecimal digit. (e.g.,: ROW-X, 0x1234, 0x120, 0x1, 0x000)

#### ROW-I, 0x0DDA, 0x6E, 0x1, 0x176

**7.3)** At the end of simulating the above address sequence, specify the final state of the 2-way set-associative cache using the tag of the cache blocks stored in the cache.

Way 0, Set 0 Tag: 0x01C

Way 0, Set 1 Tag: 0x246

Way 1, Set 0 Tag: 0x0F8

Way 1, Set 1 Tag: 0x06E

## **Question 8: Hierarchical Page Tables (5 Points)**

Assume 48-bit byte-addressable ISA system that supports virtual memory with the following specifications:

3-level page table

Page size: 4 KB

Physical memory size:
 16 GB

Size of each 3rd level page table: 8 pages

• Size of each 2nd level page table: 1 page

Page table entry size:
 4 bytes

Determine the following:

Page Offset Size # bits (enter only the number here): 12 Bits

Size of physical page number (PPN) # bits (enter only the number here): 22 Bits

Size of 3rd level page table index # bits (enter only the number here): 13 Bits

Size of each 3rd level page table, 2<sup>N</sup> Bytes (enter only the exponent number N here): 2<sup>15</sup> Bytes

Size of 2nd level page table index # bits (enter only the number here): 10 bits

Size of each 2nd level page table 2<sup>N</sup> Bytes (enter only the exponent number N here): 2<sup>12</sup> Bytes

Size of 1st level page table index # bits (enter only the number here): 13 Bits

Size of 1st level page table 2<sup>N</sup> Bytes (enter only the exponent number N here): 32 KB

In the worst case, how much memory would this 3-level page table occupy? State your answer in the form  $2^1 + 2^2 + ... + 2^n$  bytes.  $2^{15} + 2^{25} + 2^{38}$  Bytes

For the same page size, if this system used a single-level page table instead of a three-level page table, how much memory would it occupy in the worst case? State your answer in the form  $2^1 + 2^2 + ... + 2^n$  bytes.  $2^{8}$  Bytes

## **Question 9: Hierarchical Page Tables (6.5 Points)**

Assume 16-bit ISA that uses a 3-level page table. Each virtual memory address is split into following fields:

| 1st level index size: 2nd level index size: 3rd level index size: Physical   4 bits 2 bits | ex size: Physical page offset size: 2 bits |
|--------------------------------------------------------------------------------------------|--------------------------------------------|
|--------------------------------------------------------------------------------------------|--------------------------------------------|

Consider the following virtual addresses accessed in the order given below. Determine the 1st, 2nd and 3rd level indices for each access.

| Virtual Address | 1st level index | 2nd level index | 3rd level index |
|-----------------|-----------------|-----------------|-----------------|
| 0xA7E4          | 1010            | 011111          | 1001            |
| 0xB09B          | 1011            | 000010          | 0110            |
| 0xA78F          | 1010            | 011110          | 0011            |
| 0xC78F          | 1100            | 011110          | 0011            |
| 0xB098          | 1011            | 000010          | 0110            |

What goes in the three empty boxes for Row A, Virtual Address 0xA7E4?

What goes in the three empty boxes for Row B, Virtual Address 0xB09B?

What goes in the three empty boxes for Row C, Virtual Address 0xA78F?

What goes in the three empty boxes for Row D, Virtual Address 0xC78F?

What goes in the three empty boxes for Row E, Virtual Address 0xB098?

How many first-level page tables will have been allocated after these 5 memory accesses? (enter only the number here)

1

How many second-level page tables will have been allocated after these 5 memory accesses? (enter only the number here)

3

How many third-level page tables will have been allocated after these 5 memory accesses? (enter only the number here)

4

## **Question 10: Virtual Memory (11 Points)**

Assume 16-bit ISA and the following system configuration.

#### Cache:

Physically addressed 2-way set associative

Block size: 4 bytes Cache size: 16 bytes

#### Latency for each memory component:

Disk: 1000 ns Physical memory: 50 ns TLB: 2 ns

Cache: 1 ns

## **Memory system:**

Physical memory size: 4KB

Single level page table.

The TLB has 2 entries and is fully associative

#### **Assume the following:**

- The first 8 pages (PPN: 0 7) in the physical memory are empty and free to use.
   And the rest are reserved for the OS. The page table is pinned in the physical memory. It is not cached, except in TLB.
- Pages that are not in memory are on disk

 All updates are in parallel. Upon retrieval from cache, main memory or disk, the data is sent immediately to the CPU, while other updates occur in parallel
 A process accesses the following virtual addresses in order. Latency for each memory access is given.

Based on the access latencies, determine the outcomes of cache and TLB access, and whether it is a page fault or not.

| Access<br>Number | Virtual<br>Address | Latency<br>(ns) | TLB(H/M/N<br>A) | Cache (H/M/NA) | Page Fault<br>(Y/N) |
|------------------|--------------------|-----------------|-----------------|----------------|---------------------|
| 1                | 0x126C             | 1052            | М               | NA             | Y                   |
| 2                | 0x122F             | 1052            | М               | NA             | Υ                   |
| 3                | 0x122B             | 53              | Н               | M              | N                   |
| 4                | 0x126D             | 3               | Н               | Н              | N                   |
| 5                | 0x35AC             | 1052            | М               | NA             | Υ                   |
| 6                | 0x122A             | 53              | М               | Н              | N                   |
| 7                | 0x125B             | 103             | М               | М              | N                   |

Based on the above accesses and their latencies, determine the page size by answering the following questions.

**Page size** is greater than or equal to (>=) 2<sup>N</sup> Bytes. Determine the maximum value for N, and enter *only the exponent number* N here:

Your choice of Page size (>=) is known because of which access numbers from the table (choose exactly two):

because of access # \_\_\_\_1 / 4\_\_\_\_ and access # \_\_\_\_7\_\_

**Page size** is less than or equal to (<=) 2<sup>N</sup> Bytes. Determine the minimum value for N, and enter *only the exponent number* N here:

6

Your choice of Page size (<=) is known because of which access numbers from the table (choose exactly two):

because of access # \_\_\_\_1 / 4 / 7\_\_\_\_ and access # \_\_\_\_2 / 3 / 6\_\_\_\_

Therefore, page size is  $2^{N}$  Bytes (enter *only the exponent number* N here):

## Question 11: Stack-Based ISA (12 Points)

Consider a new Stack-ISA with the following instructions.

Its semantics are defined by the following two ISA-visible components:

- A stack memory
- A special register called COMPARE. It can store a value of 1 or 0.

| Assembly Instruction   | Execution semantics                                                                                                |
|------------------------|--------------------------------------------------------------------------------------------------------------------|
| pushi <u>immediate</u> | stack.push(immediate)                                                                                              |
|                        | (Labels resolve to their addresses)                                                                                |
| push <u>addr</u>       | stack.push(mem[addr])                                                                                              |
| less                   | <pre>first = stack.pop(); second = stack.pop() if (first &lt; second):     COMPARE = 1 else:     COMPARE = 0</pre> |
| greater                | <pre>first = stack.pop(); second = stack.pop() if (first &gt; second):     COMPARE = 1 else:     COMPARE = 0</pre> |
| bcmp                   | <pre>destination = stack.pop() if (COMPARE == 1):     COMPARE = 0     PC = destination</pre>                       |

## **11.1)** Use the instructions in Stack-ISA to implement "Unconditional branch to address 20"

| i 1  |
|------|
|      |
| i O  |
| less |
| bcmp |
|      |
|      |

**11.2)** Use the instructions in Stack-ISA to implement "conditional branch to 20 if mem[5] is equal to mem[8]".

(Hint: you should use part a in your implementation)

```
if (mem[5] == mem[8]) {
                            pushi neq
                                       SOLN 1
                                                  SOLN 2
                                          push 5    push 8
push 8    push 5
   Branch to PC = 20
}
                              less
                                           bcmp
                                _____pushi neq
                                          push 5 push 8
                                           push 8 push 5
                              greater
                                          bcmp
                                           pushi 20
                         eq
                             pushi 1
                              pushi 0
                                            less
                                            bcmp
                         neq noop
```